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(57) The user interface in an automatic speech 
recognition (ASR) system is is dynamically con- 
trolled, based upon the level of confidence in 
the results of the ASR process. In one embodi- 
ment, the system is arranged to distinguish 
error prone ASR interpretations from those 
likely to be correct, using a degree of confi- 
dence in the output of the ASR system deter- 
mined as a function of the difference between 
the confidence in the "first choice" selected by 
the ASR system and the confidence in the 
"second choice" selected by the ASR system. 
In this embodiment, the user interface is ar- 
ranged so that the explicit verification steps 
taken by the system as a result of uncertain 
information is different from the action taken 
when the confidence is high. In addition, diffe- 
rent treatment can be provided based upon the 
"consequences" of misinterpretation as well as 
the historical performance of the system with 
respect to the specific user whose speech is 
being processed. In another embodiment, after 
an ASR system interprets an utterance, the 
confidence in the interpretation is assessed, 
and three different interactions with the user 
may then occur. 
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Field of the Invention 

This invention relates to Automatic Speech Rec- 
ognition (ASR), and in particular, to the user interface 
process provided in a system using automatic speech 
recognition wherein a confidence measure of the 
ASR interpretation of an individual's speech input is 
computed and then used to selectively alter the treat- 
ment afforded to that individual. 

Background of the Invention 

Automatic Speech Recognition (ASR) systems 
have begun to gain wide acceptance in a variety of ap- 
plications. U.S. Patent 4,827,500 issued to Binkerd et 
al. on May 2, 1 989 describes a technique for Automat- 
ic Speech Recognition to Select Among Call Destin- 
ations in which a caller interacts with a voice re- 
sponse unit having an ASR capability. Such systems 
either request a verbal input or present the user with 
a menu of choices, then wait for a verbal response, in- 
terpret the response using ASR, and carry out the re- 
quested action, all without human intervention. 

An important issue in designing the user interface 
to a system using ASR concerns the issue of handling 
the potential for recognition errors, since it is recog- 
nized that whenever an ASR system interprets an ut- 
terance, there is some residual uncertainty as to the 
correspondence between the utterance and the inter- 
pretation. This problem is especially important for in- 
put of digit strings, such as in a system in which tele- 
phone numbers or credit card numbers are spoken by 
the caller, because it is not uncommon to have an 
overall accuracy rate of only 85 to 90 percent for a dig- 
it string (and, in some cases, even for a segment of a 
digit string). To deal with potential errors, systems to- 
day use some type of explicit verification for all trans- 
actions where the error rate causes concern in order 
to avoid the possibility of processing an incorrect digit 
string. For example, following input of each connect- 
ed digit string, the ASR system may "read back" the 
best digit string candidate, and require an affirmative 
or negative response from the individual using the 
system. An example would be: "Please say 'yes' if 
your credit card number is XXX-YYYY, and please 
say 'no' otherwise". While this type of explicit verifi- 
cation is often necessary and useful, it is cumber- 
some, time consuming and annoying, especially for 
frequent users of the ASR system, or users for whom 
the ASR system has a high degree of confidence. 
Other systems have requested a user to re-input a 
speech request if a previous request could not be rec- 
ognized. However, when recognition does occur, a 
static verification process is employed. 

Summary of the Invention 

In accordance with the present invention, the 



user interface in a system that uses automatic speech 
recognition (ASR) technology is arranged to provide 
a dynamic process wherein different treatment is giv- 
en to a user, based upon the level of confidence in the 

5 results of the ASR process. In one embodiment, the 
system is arranged to distinguish error prone ASR in- 
terpretations of an utterance from those likely to be 
correct, using a level or degree of confidence in the 
output of the ASR system. The confidence can be de- 

10 termined as a function of the difference between the 
proximity scores (defined below) for the first and sec- 
ond choices selected by the ASR system. In this em- 
bodiment, the user interface is arranged so that the 
explicit verification steps taken by the system when 

15 confidence is relatively lower, is different from the ac- 
tion taken when the confidence is high. In addition, 
different treatment can be provided based upon the 
"consequences" of misinterpretation as well as the 
historical performance of the system with respect to 

20 the specific user whose speech is being processed. 
In another embodiment of the invention, after an ASR 
system interprets an utterance, the confidence in the 
interpretation is assessed, and three different interac- 
tions with the user may then occur. 

25 Illustratively, where the ASR system is used to 

recognize numerical digits, the confidence in an inter- 
pretation can be determined by assigning a proximity 
score between each uttered digit and each digit mod- 
el for which the ASR system has been trained, where 

30 a large score indicates good correspondence. Thus, 
a vector is created for each utterance that indicates 
the proximity of that utterance to each model. A high 
confidence is said to exist when the proximity score 
for the model with the closest proximity is much larger 

35 than the proximity score for the next best choice. This, 
in essence, means that the interpretation is much bet- 
ter than any alternative. 

By mapping the confidence or "certainty level" of 
the results of the ASR system performance into sev- 

40 eral different action alternatives that are determined 
by detailed analysis of the consequence of making an 
error and the difficulty for the user of responding to a 
verification request and/or re- entering the informa- 
tion, the user interface to the system is considerably 

45 improved, and a user is only required to re-enter or 
verify a speech input when such action makes sense. 

Brief Description of the Drawings 

50 The present invention will be more fully appreci- 

ated by consideration of the following detailed de- 
scription, which should be read in light of the accom- 
panying drawing in which: 

Fig. 1 is a flow diagram illustrating the steps fol- 
55 lowed in a conventional ASR system when a per- 

son dials a telephone number with voice input; 
Figs. 2 and 3 together are a flow diagram illustrat- 
ing the steps followed in an ASR system arranged 
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in accordance with the present invention, for re- 
sponding to a person dialing a telephone number 
with voice input; 

Fig. 4 is a block diagram illustrating one arrange- 
ment for a voice processing unit arranged to im- 
plement a dynamic user interface process, such 
as the process described in Figs. 2 and 3; 
Fig. 5 is a flow diagram illustrating the steps fol- 
lowed in an ASR system arranged in accordance 
with the present invention, in which three differ- 
ent outcomes result from ASR processing that 
yields three possible confidence levels. 

Detailed Description 

Referring first to Fig. 1 , there is shown a flow di- 
agram illustrating the steps followed in a conventional 
ASR system. In this example, a person dials a tele- 
phone number with voice input, and the ASR system 
interprets the person's utterance and takes action, 
such as completing a telephone call, in response to 
the interpretation of the utterance obtained from the 
ASR system. More specifically, a transaction involv- 
ing the dialing of a 10 digit telephone number having 
a three digit area code followed by a seven digit local 
telephone number, is described. 

The process of Fig. 1 begins in step 101, when a 
caller is connected to a speech processing platform, 
described below in connection with Fig. 4. The plat- 
form is arranged to provide audible prompts, to re- 
ceive speech inputs, and to interpret the speech using 
ASR techniques. In step 103, the user is prompted by 
an audible announcement to enter the area code for 
the telephone call, by speaking the three digits in step 
105. In step 106, any well known automatic speech 
recognition process is performed, and a determina- 
tion is made of the digits spoken by the caller. In gen- 
eral, the interpretation performed by the ASR process 
typically involves comparison of the user inputted 
speech with stored speech samples. However, the 
ASR system can be arranged to implement any of 
several well known speech recognition processes. 

After the three digits of the area code are recog- 
nized in step 106. the system, in step 107, requests 
the caller to explicitly verify that the recognized digits 
are, in fact, the same as the digits the user spoke in 
step 105. The user then responds with a "yes" or "no" 
answer in step 108, and the system takes different ac- 
tion in branching step 111, depending upon the re- 
sponse. In particular, if the answer received in step 
1 08 is "yes", indicating that the first three digits were 
correctly recognized, the process continues with step 
113, in which the user is prompted for the remaining 
7 digits of the telephone number. The user speaks 
these seven digits in step 115, and, in step 116, a de- 
termination is made of the digits spoken by the caller, 
again using thts ASR process as in step 106. Next, in 
step 117, the caller is requested to explicitly verify 



that the recognized digits are the same as the digits 
spoken in step 115. If a "yes" is spoken in step 119, 
the positive response is recognized in branching step 
121, and the system proceeds to complete the trans- 
5 action, using all ten of the recognized digits, in step 
123. 

If a negative response is received from the caller 
in step 108 or 119, that response causes branching 
steps 111 or 121 to transfer control to steps 125 or 

w 127, respectively, in which a determination is made 
as to whether too many failed attempts have already 
been processed. This may be accomplished by initial- 
izing a counter when the process is begun, by incre- 
menting the counter each time a "no" response is en- 

15 countered in step 111 or 121, and by comparing the 
count in the counter to a predetermined threshold. If 
a negative response is indicated in steps 125 or 127, 
indicating that the threshold has not been exceeded, 
the process can be repeated, as by performing either 

20 steps 103-111 or 113-121 for additional recognition 
attempts. If a positive response is indicated in steps 
125 or 127, the automatic speech recognition has 
"failed", and the caller may be connected to a human 
attendant in step 126 or 128. 

25 The process illustrated in Fig. 1 produces the 

same treatment of the user, i.e, the same dialog be- 
tween the user and the system, regardless of the con- 
fidence of the speech recognition accomplished in 
steps 106 and 116, and regardless of the historical de- 

30 tails associated with previous verification attempts 
by the same user. This cumbersome, static approach 
is eliminated by the present invention, in favor of a dy- 
namic approach which uses the confidence level as- 
sociated with the speech recognition performed in 

35 steps 106 and 116 to alter the treatment afforded to 
the user. 

Specifically, referring now to Figs. 2 and 3, there 
is shown a flow diagram illustrating the steps followed 
in an ASR system arranged in accordance with the 

40 present invention, for responding to a person dialing 
a telephone number with voice input. In this exem- 
plary process, the same transaction as described 
above is performed, namely, a transaction involving 
the dialing of a 10 digit telephone number having a 

45 three digit area code followed by a seven digit local 
telephone number. The process begins in step 201, 
when a caller is connected to a speech processing 
platform arranged to perform the same functions as 
described above, and, in addition, to provide an indi- 

50 cation of the confidence level associated with the rec- 
ognition being performed. The details of a confidence 
level determination are described in more detail be- 
low. One exemplary technique forgenerating a con- 
fidence measure in connection with automatic 

55 speech recognition systems is described in an article 
entitled "Recognition Index: A Statistical Approach to 
Vocabulary Diagnostics'* by K.P.Avila et al., Speech 
Technology, Oct-Nov 1987, Vol. 4, No. 1, Pages 62- 
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67. 

In step 203, the user is prompted by an audible 
announcement to enter the area code for the tele- 
phone call, by speaking the three digits in step 205. 
In step 206, an automatic speech recognition process 
is performed, and a determination is made of the dig- 
its spoken by the caller. As before, the interpretation 
performed by the ASR process typically involves 
comparison of the user inputted speech with stored 
speech samples. However, the ASR system is also ar- 
ranged to provide a confidence value, which is an in- 
dication of the confidence level associated with the 
recognition. As illustrated in Fig. 2, the confidence 
analysis performed in step 231 can have two out- 
comes, designated as "very high confidence" or 
"moderate confidence". As explained below in con- 
junction with Fig. 5, more than two confidence levels 
can be used, andthe def initions of the various- levels- 
can differ. 

If the confidence level determined in step 231 is 
"moderate confidence", the process continues in 
much the same was as described above. In particular, 
the system, in step 207, requests the caller to explic- 
itly verify that the recognized digits are, in fact, the 
same as the digits the user spoke in step 205. The 
user then responds with a M yes" or "no" answer in step 
208, and the system takes different action in branch- 
ing step 211 , depending upon the response. In partic- 
ular, if the answer received in step 208 is "yes", indi- 
cating that the first three digits were correctly recog- 
nized, the process continues with step 213, in which 
the user is prompted for the remaining 7 digits of the 
telephone number. The user speaks these seven dig- 
its in step 215, and, in step 216, a determination is 
made of the digits spoken by the caller, again using 
the ASR process as in step 206. However, as in step 
231, the ASR system is arranged to provide an indi- 
cation of the confidence level associated with the rec- 
ognition. As illustrated in Fig. 3, the confidence ana- 
lysis performed in step 233 can have two outcomes, 
designated as "very high confidence" or "moderate 
confidence". If the outcome of step 233 represents 
"moderate confidence", the caller is requested in step 
217 to explicitly verify that the recognized digits are 
the same as the digits spoken in step 215. If a "yes" 
is spoken in step 218, the positive response is recog- 
nized in branching step 221 , and the system proceeds 
to complete the transaction, using all ten of the rec- 
ognized digits, in step 223. 

In a manner similarto that used in Fig. 1, note that 
if a negative response is received from the caller in 
step 208 or 218, that response causes branching 
steps 211 or 221 to transfer control to steps 225 or 
227, respectively, in which a determination is made as 
to whether too many failed attempts have already 
been processed. If a negative response is indicated 
in steps 225 or 227, indicating that the threshold has 
not been exceeded, the process can be repeated, as 



by performing eithersteps 203-211 or213-221 for ad- 
ditional recognition attempts. If a positive response is 
indicated in steps 225 or 227, the automatic speech 
recognition has "failed", and the caller may be con- 

5 nected to a human attendant in step 226 or 228. 

If the confidence analysis performed in steps 231 
or 233 indicates recognition with "very high confi- 
dence", a different treatment is given to the user. 
Specifically, if the first three digits are recognized 

w with very high confidence, steps 207, 208 and 211 are 
skipped, so that the decision reached during speech 
recognition with respect to the first three digits is not 
explicitly verified. Then, if the next seven digits are 
also recognized with very high confidence, steps 21 7, 

15 218 and 221 are skipped, so that the decision reached 
during speech recognition with respect to the next 
seven digits is not explicitly verified. Therefore, it is 
seen that the process illustrated in Figs. 2 and 3 is 
adaptive, in that it produces a different dialog be- 

20 tween the user and the system. The dialog is depend- 
ent upon the level of confidence of the speech recog- 
nition accomplished in steps 206 and 216. 

As shown in Fig. 4, a typical speech processing 
unit 301 can be arranged to be used in the context of 

25 a telecommunications network, as illustrated in Fig. 
1 of U.S. Patent4,922,519 issued to A.N. Daudelin on 
May 1, 1990, which is incorporated herein by refer- 
ence. Speech processing unit 301 includes a commu- 
nications interface 311 which connects it to other sys- 

30 tern components via a trunk 315. Interface 311 and 
trunk 31 5 can support multiple simultaneous two-way 
conversations, so that a plurality of calls can be han- 
dled at any given time. The processes performed in 
speech processing unit 301 are controlled by a cen- 

35 tral processing unit (CPU) 303 which, in turn, oper- 
ates under the control of stored programs contained 
in a memory such as database 309. Functionality 
which is available in speech processing unit 301 in- 
cludes (a) the ability, using a speech generator 307, 

40 to play announcements to a user, and (b) the ability, 
using ASR module 305, to interpret utterances re- 
ceived from a user. The sequencing of the announce- 
ments from speech generator 307 and the recogni- 
tion operations performed in ASR module 305 togeth- 

45 er constitute the user interface which is dynamically 
controlled in accordance with the present invention. 
The elements of speech processing unit are intercon- 
nected with each other and with communications in- 
terface 311 via a common bus 313. 

50 As stated above, the output from ASR module 

305 includes an interpretation of the utterance re- 
ceived from a user, as well as an indication of the con- 
fidence in the interpretation. The latter information is 
made available to CPU 303, so that the user interface 

55 process may be dynamically adapted based upon the 
confidence level. 

Speech processing unit 301 can be implemented 
using a Conversant MAP 100 Voice Response Unit 
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available from AT&T, that is outfitted with a speech 
recognition package, and the control software stored 
in database 309 can be generated using an interac- 
tive tool known as a Script Builder. However, it is to 
be noted that the specific arrangement of speech 
processing unit 301 of Fig. 4 is illustrative only, and 
that other alternatives, such as those described in the 
references which are cited in the Daudelin patent, will 
be apparent to those skilled in the art In particular, it 
is to be understood that while the processes descri- 
bed in connection with Figs. 1 and Figs. 2 and 3 relate 
to the use of speech recognition in the context of mak- 
ing telephone calls, speech recognition can also be 
used in a "local" process, such as when a user inter- 
acts with a computer or an appliance. A dishwasher 
or a personal computer can be arranged to respond 
to verbal commands by incorporating an automatic 
speech recognition unit in the apparatus. In accor- 
dance with the invention, the computer may be ar- 
ranged to format a disk in response to the recognition 
of a verbally uttered "format" command. Since format- 
ting is a serious operation that may result in the loss 
of data, the command is executed only if the com- 
mand is recognized with a very high degree of confi- 
dence. If the confidence level is moderate, the user 
may be asked to explicitly verify the command by say- 
ing the word "yes" or by repeating the command. If the 
confidence level is low, the user may be required to 
type the command on the keyboard. In such a local 
arrangement, communications interface 311 would 
be connected to a speech input device, such as a mi- 
crophone, and an output device such as a speaker or 
a display panel. 

Referring now to Fig. 5, another embodiment of 
the invention is illustrated by a different user inter- 
face process. In this embodiment, a user is prompted 
for a speech input in step 400, and after the ASR mod- 
ule 305 interprets the user's utterance in step 401 , the 
confidence in the interpretation is determined in step 
403, and then assessed in terms of three possible lev- 
els, and three different interactions with the user may 
then occur. First, if the interpretation has a very high 
likelihood of being correct, a positive result is reached 
in step 405, and the ASR interpretation is accepted 
without explicit verification in step 407, despite the 
possibility of an occasional error. The transaction is 
then completed in step 409. Second, for an intermedi- 
ate level of uncertainty, a positive result is reached in 
step 411, whereupon the user is asked to explicitly 
verify (or dis-verify) the result in step 413, because 
this may offer an advantage over forcing the user to 
re-enter the information (by voice or otherwise). If the 
result is verified, a positive result occurs in step 415, 
and the transaction is completed in step 409. If the re- 
sult is not verified, a negative result occurs in step 
415, and the user is required to repeat the process, 
beginning with step 400, provided that too many failed 
attempts have not occurred, as determined in step 



417. Third, where the uncertainty is large, and/or the 
consequence of misinterpretation is severe, the re- 
sults of both steps 405 and 411 are negative. This 
condition is treated as a "failure to interpret", and the 

5 user may be required to "try again" without attempt- 
ing an explicit verification of the (possibly) wrong re- 
sult. This is achieved by repeating the process begin- 
ning at step 400, again provided that the user has not 
failed too many times, as indicated in step 417. If too 

10 many failures have occurred, the process of Fig. 5 
ends in step 41 9, whereupon the user may, in the con- 
text of a telephone call, be connected to a live attend- 
ant. 

The confidence analysis performed in steps 231 

15 and 233 of Figs. 1 and 3, and performed in steps 405 
and 411 of Fig. 5, can be accomplished by assigning 
a proximity score for each uttered digit to each digit 
model for which it has been trained, where a large 
score indicates good correspondence and a small 

20 score indicates poor correspondence. This approach 
creates a confidence value vector for each spoken 
digit that indicates the proximity of that utterance to 
each model. We have observed that it is more likely 
that the option with the closest proximity is the correct 

25 choice whenever the magnitude of the confidence 
value of Lhe next closest proximity is much smaller. 
Thus, a function of the difference between these two 
proximity scores is used to determine the confidence 
level that the "best" choice interpretation of an utter- 

30 ance is indeed the "correct" choice. Confidence level 
determination can be accomplished using many alter- 
natives, all of which use the specific data from an 
ASR system to distinguish utterances that are likely 
to be correct from those that are less likely. From this 

35 perspective, a particular error rate can be viewed as 
being derived from a universe that contains x% of 
numbers that contain < a% errors (and can be viewed 
as not very error prone) and y% of numbers that con- 
tain >b% and <c% errors (a more error prone set) and 

40 z% of numbers that contain >c% errors (a set deemed 
unlikely to be correct). Experiments with the ASR sys- 
tem and known speech samples can be used to de- 
termine which specific values should be used for x, y 
and z, and a, b and c. 

45 It is also to be noted here that the relative "prox- 

imity" of two possible outcomes of a speech recogni- 
tion task can be characterized in different ways. The 
ratio or linear difference in scores may be used, or 
some more complex function may be employed. The 

so specific determination of "proximity" that is optimal 
will depend on the nature of the particular model be- 
ing used and the algorithm that computes the similar- 
ity measure. Other variables may also be involved. 
In accordance with the present invention, histor- 

55 ical details, such as a success measure associated 
with previous verification attempts of the same user, 
can be used to dynamically alter or adapt the ASR 
process and the manner in which the ASR system in- 
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teracts with the user, since all users of ASR systems 
do not experience the same success levels nor gen- 
erate the same confidence levels. The labels "sheep" 
and "goats" can be used to describe this arrange- 
ment, namely that the ASR process used for some 
people (i.e., "sheep") for whom the process works 
well, is different from the process used for other peo- 
ple (i.e., "goats") for whom the process works poorly. 
Clearly, when an ASR system introduces an explicit 
verification step in the user interface, it improves the 
system performance for goats in that fewer errors are 
permitted to occur. At the same time, it degrades the 
quality of the interface for all users by introducing the 
extra interaction, and the sheep (whose speech is 
generally understood by the system), have less of a 
need for that step. 

The use of a historical "success measure 1 * per- 
mits accommodation of both types of users, because 
the "success measure" permits differentiation be- 
tween users that are likely to be sheep and those who 
are likely to be goats. Determination or prediction of 
which individuals are "ASR sheep" is possible when 
ASR processing is used in connection with a sub- 
scriber-based service, where the same users are in- 
volved over a period of time. In such services, it is 
quite easy to track, for a given user, how often the 
ASR system returns a high confidence score and/or 
how often a particular user is successful, with or with- 
out explicit verification. Users who consistently re- 
ceive high confidence scores and/or who consistently 
succeed are "presumed sheep". For these users, the 
verification step can be dispensed with, even if the 
confidence level is not "very high" on some occa- 
sions. Indeed, for persons for whom the ASR system 
has historically performed well, a moderate confi- 
dence level can lead the process to skip explicit ver- 
ification and dispense with steps 207, 208 and 211 
and/or steps 217, 218 and 221 in Figs. 2 and 3, and 
to dispense with steps 41 3 and 41 5 in Fig. 4. For users 
who have a large success measure, those steps 
would thus only be performed when the results in step 
231 or 233 produced a "low" confidence level, or 
when the results of both steps 405 and 411 was neg- 
ative. Note here that in some implementations in 
which historical information cannot be obtained, such 
as when a new user operates a computer using voice 
commands, it is not feasible to compare historical 
user utterances with ASR recognition and to track 
how often recognition is successful. 

The historical information needed to differentiate 
between various classes of users can be stored in da- 
tabase 309 of Fig. 4 and retrieved in response to an 
individual's access to speech processing unit 301 . For 
example, the user can be identified by automatic 
number identification (AN I) information which is pre- 
sented to an originating switch when a telephone call 
is originated from a telephone station served by that 
switch. Alternatively, the user can be identified by a 



personal identification number (PIN) that is provided 
by the user as part of the ASR process. In either 
event, the ANI or PIN is used as a retrieval key to as- 
certain information from the database indicating if a 

5 particular user is one for whom the process should be 
changed, and, if so, how it should be changed. In es- 
sence, the system can thus determine is the user is 
a sheep or a goat. 

The present invention was simulated in a test 

10 which collected a 10-digit telephone number in two 
parts, a 3-digit area code and a 7-digit local number, 
using Automatic Speech Recognition on an AT&T 
Conversant System. In this experiment, confidence 
measures of digit string candidates were used to im- 

15 prove the user interface, so that the explicit verifica- 
tion steps were not performed when the first digit 
string candidate received a much higher confidence 
score than the second dijg it string candidate. Specif- 
ically, an AT&T Conversant System was arranged to 

20 assign a confidence value between 1 and 1,000,000 
to each of up to four possible digit string candidates. 
The candidate with the highest confidence value was 
called, the "first candidate"; the candidate with the 
second highest conf idence value was called the "sec- 

25 ond candidate"; and so on. The system calculated the 
difference in confidence values between the first and 
second candidates in order to determine a confidence 
level in the ASR result, and then used this difference 
to adjust the overall process in terms of which explicit 

30 verification prompts were or were not played, and 
which steps in the process were skipped. If the differ- 
ence between candidate #1 and candidate #2 was 
greater than 6000, it was assumed that the confi- 
dence was high enough to alter the process and skip 

35 the explicit verification steps. In those transactions 
where confidence score difference was less than 
6000, a dialog of the following type occurred, where 
S: represents the system prompt, and U: represents 
the user input: 

40 S: Please say just the area code that you would like 
to call, now. 
U: Nine, Zero, Eight. 
S: Did you say Nine, Zero, Eight? 
U: Yes. 

45 S: Please say the 7-digit telephone number that you 
would like to call, now. 
U: Nine, Four, Nine, Six, Five, One, Zero. 
S: Did you say Nine, Four. Nine, Six, Five, One, Zero? 
U: Yes. 

so S: Thank you... 

On the other hand, if the confidence score differ- 
ence was greater than 6000, a dialog of the following 
type occurred: 

S: Please say just the area code that you would like 
55 to call, now. 

U: Nine, Zero, Eight. 

S: Please say the 7-digit telephone number that you 
would like to call, now. 
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U: Nine, Four, Nine, Six, Five, One, Zero. 
S: Thank you... 

ASR performance and preference data that were 
collected showed that the user interface that dynam- 
ically used confidence scores to adapt the verifica- 
tion process was better than the conventional user in- 
terface. The average time to complete telephone 
number transactions was decreased by about 25 per- 
cent; users preferred the system that used confi- 
dence scores; and the percentage of "wrong number" 
calls was not increased. Simitar findings were ob- 
served for other process adjustments based on con- 
fidence scores. 

With respect to use of historical data as a success 
measure in determining the user interface in process- 
ing of ASR samples, subjects were divided into two 
groups. One group, the presumed sheep, was de- 
fined as those users for whom the recognizer had 
high confidence in at least 60% of the transactions 
(where users made up to 32 ASR attempts). The other 
group, the presumed goats, constituted the remain- 
der. For each user group, the overall ASR accuracy 
was compared with accuracy for those transactions 
where the recognizer showed "high confidence" (de- 
fined as a confidence difference score > 6000). It was 
found that the overall ASR performance showed a 
success rate of 83.8 percent However, if only those 
transactions where ASR confidence was high were 
considered, a 97.5 percent success was found, indi- 
cating that on these occurrences there is less of a 
need to have the user confirm the result as was noted 
earlier. However, recognizer accuracy can also be 
considered for just the ASR presumed sheep during 
"high confidence transactions." The data show that 
for these users, the ASR system achieves an ex- 
tremely high performance, with 406 successes in 407 
attempts for an accuracy rate of 99.8 percent. 

In short, these experiments showed that there 
are some users for whom the recognizer shows high 
confidence frequently. For such individuals, when 
confidence is high, the recognizer is virtually always 
correct. In those situations where these presumed 
sheep can be identified, an optimal ASR user inter- 
face can be defined- one that permits completion of 
transactions that are as fast or faster than speaking 
with a live attendant. This may require making real- 
time call flow decisions based on recognizer confi- 
dence scores and/or on a subscriber's stated ASR 
preferences or system usage history. The general 
point, however, is that the user interface should rec- 
ognize the different needs of the goats and sheep. 
While most current systems are optimized only for 
goats, it is possible to optimize the call flows for both 
sheep and goats. 

Various changes may be made in the present in- 
vention by those of ordinary skill in the art. According- 
ly, the invention should be iimited only by the append- 
ed claims. 



Claims 

1. A system for adapting the user interface in sys- 
tems that accept speech input and perform auto- 
5 matic speech recognition (ASR), comprising 

means for receiving an utterance; 
means for processing said utterance using 
ASR to generate an interpretation of said utter- 
ance and to determine a level of confidence in 
10 said interpretation; and 

means for selectively adapting the verifi- 
cation of said interpretation requested from the 
user as a function of said confidence level. 

15 2. The invention defined in claim 1 wherein said 
processing means is arranged to determine at 
least first and second interpretations for said ut- 
terance, said interpretations having respective 
associated first and second confidence values, 

20 and 

wherein said confidence level is deter- 
mined as a function of the relative magnitudes of 
said first and second confidence values. 

25 3. The invention defined in claim 1 wherein said 
system further includes 

means for storing, for each user of said 
system, information representing a success 
measure computed as a function of previous 

30 uses of said system, and 

means for retrieving information from said 
storing means and for adapting said user inter- 
face as a function of the value of said success 
measure. 

35 

4. The invention defined in claim 3 wherein said 
success measure includes the previous success 
rate for said each user of said system. 

40 5. The invention defined in claim 3 wherein said 
success measure includes previous confidence 
values associated with ASR interpretations for 
said each user. 

45 6. The invention defined in claim 3 wherein said 
system is arranged to compare said success 
measure to a user dependent threshold. 

7. The invention defined in claim 1 wherein said last 
50 mentioned means is arranged to adapt said ver- 
ification as a function of the consequences of an 
error in said interpretation. 

8. An automatic speech recognition system com- 
55 prising 

means for generating at least first and sec- 
ond interpretations of a user's utterance and re- 
spective first and second confidence values for 
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said interpretations, and 

means operative in response to the rela- 
tive magnitudes of said first and second confi- 
dence values, for prompting said user to verify 
said first interpretation prior to accepting said 5 
first interpretation as an accurate representation 
of said utterance. 

9. The invention defined in claim 8 wherein said 
system further includes means for prompting 10 
said user with information including said first in- 
terpretation. 

10. An automatic speech recognition system com- 
prising 15 

means for generating an interpretation of a 
user's utterance and a confidence value for said 
interpretation, and 

user interface means operative in re- 
sponse to the magnitude of said confidence val- 20 
ue, for (a) requesting said user to verify said in- 
terpretation prior to accepting said interpretation 
as an accurate representation of said utterance, 
or (b) accepting said interpretation as an accurate 
representation of said utterance without verif ica- 25 
tion. 

11. The invention defined in claim 10 wherein said 
system further includes means for storing infor- 
mation indicative of the previous success of said 30 
system in interpreting utterances of said user, 

and 

means for responsive to said stored infor- 
mation for controlling said user interface means. 

35 

12. A method of adapting the user interface in sys- 
tems that accept speech input and perform auto- 
matic speech recognition (ASR), comprising the 
steps of 

receiving an utterance; 40 
processing said utterance using ASR to 
generate an interpretation of said utterance and 
to determine a level of confidence in said inter- 
pretation; and 

selectively adapting the verification of 45 
said interpretation requested from the user as a 
function of said confidence level. 

13. The method defined in claim 12 wherein said 
processing step includes determining at least first so 
and second interpretations for said utterance, 
said interpretations having respective associated 
first and second confidence values, and 

determining confidence level as a function 
of said first and second confidence values. 55 

14. The method defined in claim 12 further including 

storing, for each user of said system, infor- 



mation representing a success measure comput- 
ed as a function of previous uses of said system, 
and 

retrieving information and altering the user 
interface as a function' of the value of said suc- 
cess measure. 

1 5. The method defined in claim 14 wherein said suc- 
cess measure includes the previous success rate 
for said each user of said method. 

1 6. The method defined in claim 14 wherein said suc- 
cess measure includes previous confidence val- 
ues associated with ASR interpretations for said 
each user. 

17. The method defined in claim 14 wherein said 
method f urther includes comparing said success 
measure to a user dependent threshold. 

18. A method of automatic speech recognition com- 
prising the steps of 

generating at least first and second inter- 
pretations of a user's utterance and respective 
first and second confidence values for said inter- 
pretations, and 

in response to the relative values of said 
first and second confidence values, prompting 
said user to verify said first interpretation prior to 
accepting said first interpretation as an accurate 
representation of said utterance. 

19. The method defined in claim 18 wherein said 
method further includes prompting said userwith 
information including said first interpretation. 

20. A method for performing automatic speech rec- 
ognition system comprising the steps of 

generating an interpretation of a user's ut- 
terance and a confidence value for said interpre- 
tation, and 

adapting the operation of a user interface 
in response to the magnitude of said confidence 
value, by (a) requesting said user to verify said in- 
terpretation prior to accepting said interpretation 
as an accurate representation of said utterance, 
(b) accepting said interpretation as an accurate 
representation of said utterance without verifica- 
tion, or (c) rejecting said interpretation and re- 
questing said user to provide a new utterance. 

21. The method defined in claim 20 wherein said 
method further includes the steps of storing infor- 
mation indicative of the previous success of said 
system in interpreting utterances of said user, 
and 

adapting said user interface in response to 
said stored information. 
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(57) The user interface in an automatic speech rec- 
ognition (A'SR) system is is dynamically controlled, 
based upon the level of confidence in the results of the 
ASR process. In one embodiment, the system is ar- 
ranged to distinguish error prone ASR interpretations 
from those likely to be correct, using a degree of confi- 
dence in the output of the ASR system determined as a 
function of the difference between the confidence in the 
"first choice" selected by the ASR system and the con- 
fidence in the "second choice" selected by the ASR sys- 
tem. In this embodiment, the user interface is arranged 
so that the explicit verification steps taken by the system 
as a result of uncertain information is different from the 
action taken when the confidence is high. In addition, 
different treatment can be provided based upon the 
"consequences" of misinterpretation as well as the his- 
torical performance of the system with respeel to the 
specific user whose speech is being processed. In an- 
other embodiment, after an ASR system interprets an 
utterance, the confidence in the interpretation is as- 
sessed, and three different interactions with the user 
may then occur. 



FIG. 5 



( BEGIH ) 



PROMPT FOR U*00 
SPEECH INPUT 



PERFORM AUTOMATIC 
SPEECH RECOGNITION 






COMPUTE CONFIDENCE 
MEASURE 







_401 



CONFIDENCE HIGH 



407 

J 



ACCEPT ASR 
INTERPRETATION 



NO 



409 TRANSACTION 
COMPLETED 



411 

vV^CONFIOENCE 
s - MODERATE ? 




NO/'*' FAILED TOO . 



YES 



K EXIT 



,419 



Printed by Jouv*. 7C001 PAIHG (Frt) 



EP 0 651 372 A3 



huropcan Patent 
Office 



EUROPEAN SEARCH REPORT 



Applicafinn h. umber 

EP 94 30 7658 



DOCUMENTS CONSIDERED TO BE RELEVANT 



Category 



Citation of document with indication, where appropriate, 
o( relevant passages 



US 5 3G5 244 A (NEWMAN EDWARD G ET AL) 19 
April 1994 

* column 17, line 42 - column 18, line 13; 

figure 10 * 

idem 



EP 0 440 439 A (NIPPON ELECTRIC CO) 7 
August 1931 

* abstract * 

* column 4, line 29 - line 59 * 

US 5 033 088 A (SHIPMAN DAVID W) 16 July 
1991 

* column 2, line 45 - line 56 * 

* column 3, line 14 - line 25 * 

EP 0 538 626 A (IBM) 28 April 1993 

* abstract; tables 3,4 * 

* page 4, line 11 - line 17 * 



1,10,12 
20 



8,18 



1,8,10, 
1Z,18.,2G 



1,8,10, 
12,18,20 



The present search report has been drawn up for all claims 



Relevant 
to claim 



8,13 



CLASSIFICATION OF THE 
APPIJCATION (lnt.CI.6) 



G10L3/00 
G10L5/06 



TECHNICAL FIELDS 
SKAKCHFJ) (Int. CI. fa) 



G10L 
H04M 



Place §f watch 

THE HAGUE 



Dale *f coBvlrltoa of I he wvcfc 

7 April 1997 



ExmlMtr 

Krembel, L 



C.A1KCORY OF CtTKD DOCL'MKKTS 

X : particular^ relevant it taken aliMtc 

Y : particularly relevant if combined with another 

document of the same category 
A : I ethnu luteal background 
O : non-written disclosure 
P : intermedial* document 



1 : theory or principle underlying the invention 
F : earlier patent dncumenl, hut published un, i>r 

alter the tiling date 
D : document cited id the application 
I. : document cited ftir ullici reborn 

& : member of the same patent family, corresponding 
document 



2 



